In [1]:
# This is to prevent stale code from being executed in those pesky *.pyc files, just in case.
%load_ext autoreload
%autoreload 2
There are a few ways of opening text files, reading them, and parsing their contents. I'll use one of our CO2 frequency outputs as an example.
In [2]:
import os
In [3]:
os.listdir()
Out[3]:
In [4]:
help(os.listdir)
In [5]:
os.listdir(path="../qm_files")
Out[5]:
In [6]:
filename = "../qm_files/drop_0375_0qm_0mm.out"
print(filename)
Python has a built-in function called open that takes a filename as a string and returns a handle to it that we can work with, so it can be read, looped over, and closed.
In [7]:
help(open)
In [8]:
handle = open(filename)
print(handle)
In [9]:
print(type(handle))
These will say something slightly different in Python 2 vs. 3, but we work with them in exactly the same way. Here are all the methods that are defined on the handle to our file.
In [10]:
dir(handle)
Out[10]:
This list of strings represents all the methods or member variables that can be called on the handle.
If our file handle is called handle, and we see name is a member of that list, we want to know what it does.
In [11]:
help(handle.name)
In [12]:
type(handle.name)
Out[12]:
In [13]:
handle.name
Out[13]:
In [14]:
handle.name()
So, through a little experimentation, we've figured out that it isn't a function, it's a variable. If it was a function, we'd be able to call it like above.
You're probably wondering what all the names that begin with __ or _ are. These are methods or member variables that aren't meant to be used directly by the user; they're for "under-the-hood" operations only. Let's look only at the parts of handle we're supposed to use.
In [15]:
print([m for m in dir(handle) if m[0] != '_'])
We can look at the type of all these too:
In [16]:
for m in dir(handle):
if m[0] != '_':
print(m, type(eval("handle.{}".format(m))))
Since we're interested in reading from a file, there are a few methods that sound like they can read.
In [17]:
help(handle.readable)
In [18]:
handle.readable()
Out[18]:
We can read from the file. I should hope so, we just opened it!
In [19]:
help(handle.read)
Well that isn't very helpful...let's look at the official documentation.
In [20]:
# This will let us do some neat stuff with the notebook, like embed webpages and videos.
import IPython
In [21]:
website = "https://docs.python.org/3.5/tutorial/inputoutput.html#reading-and-writing-files"
IPython.lib.display.IFrame(website, width=800, height=800)
Out[21]:
After a bit of light reading, it looks like we can do contents = handle.read() and contents will be a giant string that contains all of the file contents. Only one way to find out...
In [22]:
contents = handle.read()
In [23]:
print(contents)
In [24]:
print(len(contents))
Can you see why we don't normally print entire files to the screen? This isn't even a big one.
I wonder if the handle is still readable...
In [25]:
handle.readable()
Out[25]:
What if I try and read from it again?
In [26]:
contents2 = handle.read()
print(contents2)
Nothing! So, the end of the file's been reached, and we might as well close it, since we'll be working with contents, not the file (handle) itself.
In [27]:
handle.close()
handle.closed
Out[27]:
Just to reiterate, here's what we actually did to open a file, read it into a string, then close it:
In [28]:
handle = open(filename)
contents = handle.read()
handle.close()
There were a few other methods that we could call on our handle that had to do with reading, specifically handle.readline() and handle.readlines().
readline will read a single line from an open file handle up until a newline (which is the character '\n'). Basically, every time you see a linebreak or hit return, this invisible character is present.
In [29]:
handle = open(filename)
first_line = handle.readline()
second_line = handle.readline()
print(first_line)
print(second_line)
handle.close()
Notice that the newlines are being interpreted and printed. I think this is because of the print() function.
In [30]:
help(print)
In [31]:
print(first_line, end='')
print(second_line, end='')
handle.readlines() does the same thing as readline() looped over the entire file, so it returns a list of strings.
In [32]:
handle = open(filename)
contents3 = handle.readlines()
handle.close()
In [33]:
print(contents3[:10])
In [34]:
contents3[:10]
Out[34]:
There's another convenient method for strings that lets us do the same thing to a large string; rather than call split(), which will split on spaces, we call splitlines(), which will split on newlines.
In [35]:
contents.splitlines()[:10]
Out[35]:
Notice that the newlines have been removed in this case. Hopefully that doesn't bite us in the future; it may or may not be important for what we do.
I think I've shown all the ways contents of files can be read, but what about the opening and closing? There's an easier way, one where the file will be closed for us automatically.
In [36]:
with open(filename) as handle2:
contents4 = handle2.read()
In [37]:
handle2.closed
Out[37]:
This is what we call "syntactic sugar", it's something convenient. Using either is fine. I personally prefer doing it this way, because if you open a bunch of files and forget to close them, over and over again, eventually your memory usage will grow and things might get unbearably slow.
Just like everything else in Python, there are a couple of ways to do this. We can either loop over the contents of the file we've stored in our contents variable, or we can loop over the file directly. Yes, file handles are iterable, just like lists and tuples!
Here's directly looping over the file:
In [38]:
with open(filename) as handle:
for line in handle:
if 'Albrecht' in line:
print(line)
if 'Berquist' in line:
print(line)
if 'Lambrecht' in line:
print(line)
and here's looping over our stored variable:
In [39]:
for line in contents.splitlines():
if 'time' in line:
print(line)
Notice that I called contents.splitlines(); this way, we make a list of strings, so iterating will give us one string at a time.
We can't loop over contents directly. Why not?
In [40]:
for line in contents[2000:2500]:
print(line)
contents is a string; iterating over a string will give you its characters. Clearly this is nonsense. Be careful!
Now we can use our file opening/closing/looping knowledge to extract useful information from files.
Let's say I want to extract all of the vibrational frequencies from an output file, and store them in a list called frequencies as floating-point numbers.
The key to extracting information from QM outputs (or any text file, really) is to understand the context in which the information appears. What's the file structured like? How do we actually get the information we want?
I'll split the contents on newlines to make it easier to work with.
In [41]:
contents_splitlines = contents.splitlines()
The information I want occurs near the end.
In [42]:
contents_splitlines[370:]
Out[42]:
We can see the frequencies all occur on a single line:
' Frequency: 621.29 1410.25 2498.02',
and maybe we can check for whether Frequency: is in a line to get frequencies. First, we need a place to store our results.
Check all the lines for a match, and print out a match if it exists:
In [43]:
for line in contents_splitlines:
if 'Frequency:' in line:
print(line)
It worked! You'll have to take my word for it that if this occurred on multiple lines of an output (say, if there were more than 3 vibrational frequencies), this would catch every instance. That's the beauty of looping.
The only problem with the code above is that once we match the line, we don't actually do anything with it other than print it. Let's try storing it in a variable s so we can manipulate it after our loop is complete.
In [44]:
for line in contents_splitlines:
if 'Frequency:' in line:
s = line
In [45]:
print(s)
Now, to make it into a list of floats:
In [46]:
s.split()
Out[46]:
In [47]:
s.split()[1:]
Out[47]:
In [48]:
map(float, s.split()[1:])
Out[48]:
In [49]:
list(map(float, s.split()[1:]))
Out[49]:
Ok, now we know how to turn a frequency line into a list of numbers we can work with in some other piece of code.
But, now we have another problem.
What if we have more than one line that contains Frequency:? This will only catch the very last one! We need to do all this work inside the loop.
In [50]:
frequencies = []
for line in contents_splitlines:
if 'Frequency:' in line:
frequencies_oneline = list(map(float, line.split()[1:]))
frequencies.extend(frequencies_oneline)
In [51]:
print(frequencies)
We can't use list.append(); that would append a list to a list, so we'd end up with a list-of-lists. We just want a single list, so we extend() it.